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Abstract 

Caching is an effective mechanism for reducing bandwidth usage and alleviating server load. 
However, the use of caching entails a compromise between content freshness and refresh cost. An 
excessive refresh allows a high degree of content freshness at a greater cost of system resource. 
Conversely, a deficient refresh inhibits content freshness but saves the cost of resource usages. To 
address the freshness-cost problem, we formulate the refresh scheduling problem with a generic 
cost model and use this cost model to determine an optimal refresh frequency that gives the 
best tradeoff between refresh cost and content freshness. We prove the existence and uniqueness 
of an optimal refresh frequency under the assumptions that the arrival of content update is 
Poisson and the age-related cost monotonically increases with decreasing freshness. In addition, 
we provide an analytic comparison of system performance under fixed refresh scheduling and 
random refresh scheduling, showing that with the same average refresh frequency two refresh 
schedulings are mathematically equivalent in terms of the long-run average cost. 

1 1 Introduction 

The timely information dissemination is the fundamental driving force that spurs ever-growing 
technology advancement and development. The widespread of Web technologies makes the Internet 
a de facto channel for mass distribution of information. Nowaday, popular web sites such as 
www.cnn.com and www.msn.com can receive ten millions requests per day [H [10] with normal 
^ ■ request rate of 12, 000 per minute, and with peak rate of more than 33, 000 per minute during 

breaking news. Such high demands pose a significant overhead on both serving servers and networks 
surrounding the serving servers [7j. A variety of approaches and system architectures have been 
introduced to enable efficient content distribution while alleviating system load and bandwidth 
consumption. 

Web servers and browsers (web clients) are the fundamental architectural building blocks in the 
World Wide Web. A Web client is a requester of data (content) and a Web server is the provider of 
data (content). A web server manages and provides data source while Web browsers send requests 
to a Web server for a specific source data by means of URL (uniform resource locator). Upon 
receipt of a request initiated by a web client, the web server then processes the request and sends a 
response back to the web client. Figure [T] illustrates the typical data flow between the clients and 
the server. 

The content resources at servers are autonomous: they are updated independently at various 
rates without pushing updates to the clients. As a result, each client has to poll the remote 
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Figure 1: Processing flow between clients and a server 



resources at a web server periodically in order to detect changes and update its contents. This 
process is referred to as refresh synchronization. The freshness comes at a cost of resource usage; 
each request initiated by a client incurs certain communication and computation overhead for 
processing the request. As a result, it could be very costly in terms of overall bandwidth usage and 
server load when considering million of individual clients. 

Caching is an effective means of reducing the system load of servers and bandwidth usage. The 
idea behind caching is to store recently retrieved copies of remote source somewhere between the 
clients and the remote web servers. As a result, the request initiated by a web client can be diverted 
to the cached copies which are much closer to the Web client than the remote web server in terms 
of network distance. 

The caching architecture in a representative enterprise environment is illustrated in Figure [2] 
wherein a caching server is placed at the enterprise's network entrance to external networks, acting 
as an intermediary between host computers (clients) inside the enterprise network and the internet. 
Each host machine is connecting to the enterprise's network backbone, the caching server rests be- 
tween the enterprise's network backbone and the Internet. Upon receiving a request originated from 
a host inside the enterprise's network, the caching server checks to see whether the corresponding 
response has already been cached, if the cached file is present, then the caching server returns the 
client with the cached copy, saving the client from retrieving the same resource (document) repeat- 
edly from the remote server. The flow is represented by the dotted line 1 in Figure [2j If the cached 
file is absent, the caching server then forwards the request to a server. After the caching server has 
received the response from the remote server, it returns the response to the client and locally stores 
the response for subsequent requests. Its is represented by the dashed line 2 in Figure [2j It is clear 
that the use of caching server in an enterprise environment enables substantial bandwidth saving 
and dramatically improves user-perceived response time, because the cached copies are located 
within the vicinity of the clients in terms of network distance. The performance gain appears to 
be proportional to the number of users [U HI [6]. 

The performance gains of caching carry the cost of content freshness. The cached copies imme- 



2 




Figure 2: Placement of caching server at an enterprise environment 
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Figure 3: Conceptual Flow Diagram of Crawler-based Search Engine 



diately become obsolete as soon as the original copies at the remote server are updated or changed. 
There is a substantial tradeoff between content freshness and refresh cost: a frequent refresh en- 
sures the freshness of content at high cost of refresh. Conversely, an infrequent refresh inhibits the 
freshness of content but saves the refresh cost. 

The freshness-cost problem also arises in crawler-based search engine applications. Crawler- 
based search engines such as google provide a powerful tool for searching web documents, serving 
as a "yellow book" on the Web. Figure [3] presents a conceptual flow diagram of crawler-based Web 
search engines. Periodically, a Web search engine polls Web servers independently, and Web clients 
poll randomly the Web search engine searching for the directories of Web documents. It is clear 
the refresh scheduling of the web search engine is independent of the Web access by Web clients. 
In addition, the content at Web servers may change over time and it is impossible to know a priori 
exactly the arrival time of content update. To maintain the freshness of content, the Web search 
engine needs to poll web sites and update its database directory in a frequent fashion. Such a 
process is known as Web crawling. The freshness of content is regarded as one of the important 
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performance metrics for a crawler-based search engine [8]. Web crawling is a prohibitively expensive 
computation task when considering an astronomical numbers of constantly changing web pages. It 
is reported that most search engines refresh their entire directory databases once a month [8]. 

Cho and Garcia-Molina [21 [3] first present a probability model to study the impact of various 
refresh (synchronization) policies on the content freshness with an emphasis on synchronization- 
order policy, under different contexts from this paper. Our study in this paper differs from theirs 
principally in that we consider the problem of refresh scheduling, with the objective of optimally 
balancing the tradeoff between content freshness and refresh cost. 

The remainder of the paper is organized as follows: In Section 2 we consider the problem 
of refresh scheduling involving one cache element. We study and establish a generic cost model 
that accounts for both the content freshness and the refresh cost and prove that the mathematical 
equivalence between the refresh schedulings with the fixed interval and the random interval in terms 
of the overall refresh cost. Section 3 extends the obtained results into the cases involving more than 
one cache elements, with an emphasis on the uniform allocation policy. Section 4 concludes the 
paper. 

2 Mathematical Formulation and Main Results 

In this section we consider the problem of optimal refresh scheduling involving only one cache 
element. We begin with an introduction of relevant notions and definitions, followed by a cost 
analysis of the relationship between the refresh interval and the freshness of content. Finally we 
identify the optimal refresh frequency that gives the best tradeoff between the content freshness 
and the refresh cost. 

We formally give the notion of the aggregated age function. Our definition is in spirit similar 
to the one proposed by Cho and Garcia-Molina [3 [2], but differs in the sense that we take the 
"aggregated effect" of content decay into account. We then introduce the age-related cost function 
that generalizes the notion of the aggregated age function. 

Suppose that the arrival of content update at a server follows the Poisson process with intensity 
rate of A. Let {Xi,i > 1} be the interarrival times of the Poisson process. Define So = 0, S n = 

n 

J2 Xi, where Si represents the time of the ith occurrence of content update at the server. Let 

N(t) = sup{n > : S n < t}. (1) 

N(t) is a random variable that represents the number of arrivals of content updates at the server 
in the time interval (0, t]. Two closely related but different notions are given as follows. 

Definition 1 Under refresh frequency of 1/T, the age of the element e with respect to the ith 
occurrence of content update Si at time t £ [0, T) is 

Age(e,Si,t) = (t - Si)I {t>St }, 

where I{ f> 5.i. is an indicator function. 

Age(e, Si,t) is a function representing a measure for the content freshness of the element e with 
respect to the ith occurrence of content update at the server. 
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Definition 2 Under refresh frequency of 1/T, the aggregated age of the element e with respect to 
the occurrences of content update at time t 6 [0, T) is 

N(t) 

A(e,t) = ^(t-Si)I {t> s i} . 

i=l 

A(e, t) is an aggregated age function that reflects the additive effect of multiple content updates 
taking place within the interval [0,i). Cho and Garcia-Molina [2j [3] propose the age metric as 
a measure for content freshness by only considering the first occurrence of content update, that 
is, Age(e, Si,t). The major difference between ours and the definition by Cho and Garcia-Molina 
is that we consider the additive property of content freshness with respect to multiple content 
updates. 

Figure S] is an illustration of the evolution of the functions of the A(e,t), reflecting that the 
aggregated age of the element e with respect to the occurrences of content update over time. 

Arrival of Content Update 
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Figure 4: Evolution of Aggregated Age Function of A(e, t) 

To study the freshness-cost tradeoff, we introduce the age-related cost function denoted by 
C a (x) which is a nondecreasing and positive function of the age, where the variable x denotes the 
age of the element of interest and the subscript a the age of content (see Definition 1). The notion 
of age-related cost is a generalization of the notion of age. As a result, the age function defined in 
[2] is a special case of the age-related cost function with C a (x) = x. 

The association of cost with freshness in the problem formulation is partly motivated by its 
market relevance that many crawler-based search engines such as google, Inktomi and fast [8] in- 
troduce paid inclusion programs that trade the freshness of content and visibility for a payment. 
We further assume that the cost of a refresh synchronization C r , where the subscript r indicates 
the cost associated with refresh. This cost could be a measure of bandwidth usage or latency or a 
financial payment, depending on the choice of performance metric. The following theorem shows 
that under mild conditions the optimal refresh interval, T*, that minimizes the long-run mean 
average cost, exists and is unique. 
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Theorem 1 Suppose that the arrival of update is a Poisson process with intensity rate X. Let the 
refresh interval be T, and the cost of a refresh be C r . Then the long-run mean average cost of 
caching under the refresh frequency of 1/T is given as 

OT) = | + X foC a (t)dt (2) 

If the age-related cost function C a {t) satisfies (1) C a {t) is differentiable and C' a {t) > 0; (2) C a (oo) = 
oo, then there exists a unique refresh interval, designated T* , which minimizes the long-run mean 
average cost function C(T), and T* is determined by the equation 

T 

C r 



TC a {T) - J C a {t)dt 



X 



Proof For a given interval (0,t], the mean average cost over the interval (0,t] can be written as 

£7 (the random cost in interval (0, t]) 



t 

Hence, the long-run mean average cost under the refresh interval T is 



(3) 



,. £7 (the random cost in interval (0, t\) , 
C{T) = hm -i (4) 

t— >oo t 

if this limit exists. Denote the random cost on the interval ((k — 1)T, kT] by £).(T),k > 1. Due 
to the properties of stationary and independent increments of the Poisson process [9], £fc(T), k > 1 
are iid, the long-run mean average cost can be written as 

t^-oo t 1 

where the term [x\ is the floor function that gives the largest integer less than or equal to x. 

The random cost Ci(^) on the interval (0, T] consists of a refresh cost and the age-related cost 
accumulated over the interval (0, T]. For the arrival of the nth content update at time S n < T, the 
age-related cost, C a (T — S n ), is a function of the age T — S n . The aggregated age-related cost in 
the interval (0, T] is thus expressed as 

N(T) 

( C a(T- S n ))I{ N ( T )>0}- 

n=l 

Hence, the random cost £i(T) on the interval (0,T] is given as 

N(T) 

£l(T) =C r + ± ( C *( T - S n)) I{N(T)>0}- (6) 
n=l 



Note that 



N(T) oo oo 

X! C a (T - 5 , „)/{ A r( T ) >0 } = ^ C a (T - 5 , „)/{ n < A r( T )} = ^ C a (T - S , n )J{ Sn < T } (7) 

n=l n=l n=l 
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and 



± 

E (c a (T - S n )I {Sn < T} ) = J C a (T - t)f n (t)dt, (8) 

o 

where f n (t) is the probability density function of S n , and follows the gamma distribution as below 

fn(t) = r^W n ~ le ~ At > t > °- ( 9 ) 

[n — 1J! 

Substituting Eq® into Eq([8j) we obtain 

E (C a (T - S n )I {Sn < T} ) = j o C a {T - t)r——t n ~ 1 e- xt dt. (10) 



We then have 



t n ~ l e- xt dt 



E [Y, C a {T - S n )I mT)>Q} \ = Y, C *( T 

\n=l J n=l J0 V 1 1 

= \ f C a (T - t)dt = A / C a (t)dt. (11) 
Jo Jo 

Therefore, by Eqs©, ©, and (fTTjh the long-run mean average cost can be expressed as 

or) = = 9l + K^cj^dt 

The derivative of C(T) is then given as 

V / ^2 y T 2 

Define a function <p(T) as 

/o 

Observe that C'(T) and </?(T) have the same sign. It can also be verified that the 

ip'(T) = \TC' a (T) > 0, V T > 



<p(T) = T 2 C'(T) = -C r + \TC a (T) — X f T C a {t)dt (12) 

JO 



since C' a (t) > 0,Vt > 0. This shows that the function <p(T) strictly increases in T > 0. Observe 
that ip(0) = —C r < 0. Moreover, for any fixed e > 0, if T > e, then 

<p(T) = —C r + A / T (C (T) - C (t)) dt > —C r + A f (CaCO - C a (t))cft 

J0 •/ 

> -C r + \{C a (T) - C a (e))e (13) 
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Hence 97(00) = lim ip(T) = 00 since C a (oo) = 00. Considering the facts that f(0) < 0, 92(00) = 00 

T— >oo 

and <p(T) is strictly increasing in T > 0, we conclude that there must exist a unique < T* < 00 
such that 

f < 0, if < T < T* 
C'(T) = \ = 0, if T = T* 
( > 0, if T> T*. 

Therefore, T* minimizes C(T), or 

T* = arg (mmC(T)). 



Corollary 1 Suppose that the age-related cost function C a (t) = Ct, where C > is the propor- 
tionality constant. Then the long-run mean average cost function C(T) is given as 

C(T) = ^ + ^f (14) 

and is minimized at 



Also, C(T*) = V2ACG-. 

Proof Obviously C a (t) = Ct satisfies the condition in Theorem [TJ Thus, T* = arg(min C(T)) 
exists and is uniquely determined by the equation C'(T) = where 

C\T) = + f 
and the unique solution to C'(T) = is given by 



l2C r 
AC 



From Corollary [H it is shown that the optimal refresh interval decreases with increasing rate of 
the content update when the age function is linear. This result, however, is also true in general as 
shown in the following theorem. 

Theorem 2 Under the same conditions of Theorem Q the optimal refresh interval T* strictly 
decreases in A > 0. 

Proof Clearly T* is a function of A. For the sake of simplicity, we will not express this dependence 
explicitly. For the same reason we will suppress the symbol * and just use T to denote the optimal 
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refresh interval in the following derivation. From Theorem Q3 the optimal refresh interval satisfies 
<p(T) = 0, or 

- C r + XTC a (T) - A / C a (t)dt = 0. (16) 
Jo 

Taking derivative with respect to A in Eq (fT6|) . we obtain 

dT r T 

TC a (T) + \TC' a (T)— - C a (t)dt = 0. (17) 

UA JO 

From Eq (117j) it follows that 

F C a (t)dt-TC a (T) 
Jo 

[ (C a (t)-C a (T))dt<0. (18) 
Jo 



Eq (|18p immediately implies 



since C' a (T) > 0. This proves that T* 

Theorem 3 Under the same conditions of Theorem^ the fixed optimal refresh policy T* strictly 
increases in C r > 0. 

The proof is simple and straightforward, therefore is omitted. 

Remark 1 : Theorems El [3] demonstrate that the impact of the arrival rate of content update as 
well as the refresh overhead C r on the optimal refresh frequency. A high frequency of content 
update and cheap refresh cost exact a high frequency of refresh in order to maintain a certain level 
of content freshness. Conversely, an infrequent content update and expensive refresh cost require 
refresh be performed at low frequency in order to save bandwidth usage and mitigate server load. 
This analytical results obtained above not only agree well with our intuition, but also provide us 
with a quantitative connection between optimal refresh interval and arrival rate of update. 

Theorem 4 Suppose that two age-related cost functions C a \(t) and C a 2(t) satisfy the conditions 
in Theorem [Jl Let T* and T\ be the optimal refresh intervals associated with C a \{t) and C a 2{t), 
respectively. For given C r and X, if < A(t) = C a 2(t) — C a \{t) is nondecreasing in t > 0, then 
TJ > T\. 



XTC' a (T) 



dT 
~dX 



dT*(X) 

— < 

dX 

T*(A) is a strictly decreasing function of A. 



Proof According to the proof of Theorem [H it is shown that 

cT 



ip{T) = -C r + XTC al {T) - X f C al {t)dt (19) 

Jo 

strictly increases in T, and 

tpp*) = -C r + XT\C al {T\) - X f Tl C al (t)dt = 0. (20) 

Jo 
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Suppose that contrary to the claimed result, i.e., T\ < T 2 . From the monotonicity of <p(T) in 
Eq(fT2l) and tp{TQ = 0, we have ip(T* 2 ) > 0. That is 

- C r + XT*C al (T* 2 ) - X C al (t)dt > 0. (21) 

Jo 



Note that 



-C r + XT* 2 C a2 (T* 2 ) - A C a2 (i)<ft 

JO 

= -C r + AT 2 *C al (T 2 *) + AT 2 *A(T 2 *) - A / 2 C al (f)di + A / 2 A(i)cft 

Jo Jo 

= I -Cr + AT 2 *C al (T 2 *) - A ^ C al (t)dt^ + A |r 2 *A(r 2 *) - J^ 2 A{t)dt\ 

= ^(T*) + A T 2 (A(T 2 *)-A(i))cft 
J o 

> ^ 2 *)>0, (22) 

where Eq (|22p comes directly from the assumption that A(t) is nondecreasing in t > 0. However, 
as the optimal refresh policy associated with C a 2(t), it must be true that 

-C r + XT*C a2 (T*) - X F 2 C a2 (t)dt = 

Jo 

which contradicts with the inequality (|22p . Therefore, we must have T* > T 2 . I 

In the previous discussion we assume that refresh interval has a fixed length of T. This assump- 
tion is somewhat unrealistic in practice. In the following, we study the impact of random refresh 
scheduling on the long-run mean average cost. 

Let {Yi,i > 1} be a sequence of iid random variables defined on (0, oo) following certain distribution 
H. The sequence {Yi,i > 1} representing the interarrival times under a random refresh scheduling 
is assumed to be independent of the Poisson arrival of content update at a server. It is obvious 
that a fixed refresh policy is only a special case of a random refresh policy. 

Let T~L be the family of all distribution functions on (0, oo) with finite first moment. Namely, 
H = {h-. H is a CDF on (0,oo) and j°° H(t)dt < ooj 
where H(t) = 1 - H(t), Vt > 0. 

Theorem 5 Let Cjj{-) denote the long-run mean average cost under a random refresh interval Y 
characterized by certain distribution function H € H, and C(T) denote the long-run mean average 
cost under fixed refresh scheduling with interval T, then 

min Ctj = min C (T) . 

HeH T>0 
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Proof Since the sequence {Yi,i > 1} of interarrival refresh times is independent of the Poisson 
arrival of content update, it is easy to see that the random costs over the intervals (0, Yi], (Yi, Yi + 
Y2], . . . are iid. Using the same line as the proof of Theorem [IJ the long-run mean average cost is 
expressed as 

^(random age-related cost over (0,1"]) 

Ch= my) ' 

where Y has distribution H. Let £(Y) be the random cost in the cycle (0, Y], we have 
E(£0O) = E{E[C(Y)\Y}} 

f oc N (y) 

= / E(C r +J2Ca(y-S n )I {N{y)>0} )dH(y), (23) 

Jo n=l 

n 

where S n = Xi denotes the time of the nth arrival of content update at the server. Due to the 
i=i 

independence of {Xi,i > 1} and {Yi,i > 1}, from Eq(|23|) we further obtain 



E(€(Y)) = / (C r + \C a (t)dt)dH(y) 
Jo Jo 

poo f ry \ 
= C r + ^ U XC a (t)dt) dH{y). 



Therefore, the long-run expected average cost is expressed as 



It is straightforward that 



Ch ~e(y) + W) ' ( } 

min C H < min C(T) (25) 

H&H T>0 

since a fixed refresh scheduling with the interval T > is a degenerate random refresh scheduling 
when P{Y = T) = 1. 

Now for any given random refresh scheduling Y with distribution H € %. Eq(|24p can be 
reexpressed as 

Ch = C r | J °°(fr\C a (t)dH(y))dt 



E{Y) E(Y) 
C r , \f™C a (t)H(t)dt 



E(y) + " EOT) • (26) 

If we choose T = E(Y) = fj,, meaning that the fixed refresh interval T equals to the mean value of 
the random refresh interval Y, then according to Eq([2]), we have 

= Cr + \£Cam (2?) 

Subtracting Eq (l27p from Eq (l26p . we obtain 

C H -C{T) = -|y o C a (t)H(t)dt- J Q C a (t)dtj 
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= ^ C a (t)H(t)dt + J™ C a (t)H(t)dt - £ C a (t)dtj 
= C a (t)H(t)dt- £ C a (t)H(t)dt} 

> ~{c a (p)J X H(t)dt-C a (jj$J*H(t)dt} 

/ H(t)dt- / (1-H(t))dt\ 

Jo ) 
H(t)dt - n \ = (28) 



AC a (/i) f 
AC a (M) 



M Uo 



thus we have 



C H > C(T) (29) 

Combining Eq ([25|) and Eq ([29]) . we obtain 

min Cu = min C (T) 

HeH T>0 



Such a mathematical equivalence between fixed refresh and random refresh schedulings offers a 
theoretical justification for the flexibility that is needed in real application environments. Refresh 
scheduling with random variable Y over a finite interval (a, b) is more flexible than fixed refresh 
scheduling. Let T > be a given fixed refresh interval and Y be a positive random interval 
following a distribution on interval (a, b) with mean E(Y) = T. From Eq (|29p it can be seen 
that Ch > C(T) and the equality holds if and only if Y is a degenerate random variable when 
P(Y = T) = 1. Thus, for any given T > 0, 5 > satisfying T — 5 > 0, if we denote T~L* = 
{H : H is a CDF on (T — 5,T + 5) and /^"/ tdH(t) = T}, then we have Ch > C(T) for any non- 
degenerate H E In particular, if U(5) is the uniform distribution on (T — 5,T + 5), then it is 
true that Cxju\ > C(T). It is also easy to show that lim^o Cma) = C(T). 



3 Synchronizing Multiple Content Elements 

In the section we will extend this approach to the case that involves more than one elements with 
different content update rates. 

Consider a server containing content elements S = {e±, • • • , e^i}- Each element is being up- 
dated according to a Poisson process with intensity A« (1 < i < M). It is theoretically appealing to 
synchronize each content element using refresh interval Tj, which is referred to as non-uniform 
allocation policy [21 [3]. However, uniform allocation policy [21 [3], i.e., synchronizing all content 
elements by using the same refresh interval T, is practically preferable by amortized cost anal- 
ysis. The underlying reason is that each refresh requires a connection establishment along with 
bandwidth usage and processing overhead. A connection overhead (latency) denoted as C conn is 
considered as the dominating factor in determining refresh cost C r . The advantage of the uniform 
allocation policy over the non-uniform allocation policy is its ability to share its connection cost 
over the number of the content elements. As a result, the amortized connection for the uniform 
allocation policy is calculated as C conn /M, which is much cheaper than that of the non-uniform 
allocation policy, C conn . 
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Let Aj be the intensity rate of Poisson process describing the content update of the cache element 
&ii 1 < i < M. From Theorem [IJ the long-run mean average cost of caching over the entire S is 



C(T) = y^^^) _|_ ^ Iq Cg(t; Xj)dt ^ 



i=l 



where C r (Aj) is the cost associated with a refresh of element e^, and C a (t; Aj) is the age-related cost 
function associated with element ej, if the uniform allocation policy T is applied. Then C(T) can 
be rewritten as 



C(T) = M[^ + ^ C f )d \ (30) 
where 

n — Si=i Cr(Aj) , , 

Gr " m { } 

and 

a a (i) = g£i^MiM . ( 32) 

If the size of the set of cache elements, M, is sufficiently large, then it is convenient to describe 
all Aj by a probability density function g(X) on (0, oo). With this assumption, the long-run mean 
average cost of caching the entire set of elements S is given as 

C(T) = M[^ + ^ C f )dt ], (33) 

where 

C r = / C r (A)ff(A)dA 
Jo 

and 

/■oo 

C a (t)= / XC a (t; X)g(X) dX. 
Jo 

Combining Eq (|30p and Eq (|33p . we see that minimizing C(T) is equivalent to minimizing 

| + (34) 

Since the form of Eq (l34p is the same as Eq([2]), we immediately obtain the following result. 

Theorem 6 Suppose that the arrival of update of each content e is a Poisson process with intensity 
rate X, C r (X) is the cost associated with a refresh, and the refresh interval is T. Then the long-run 
mean average cost of caching S is given as 

C(T) = M[^ + ^ C f )dt ], 

where 



poo 

C r = / C r (A) dG(X), 
Jo 

POO 

C a (t)= / XC a (t;X)dG(X), 
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and G(\) is a distribution function on (0, oo) satisfying \C a (t; A) dG(X) < oo for any t > 0. 
The problem of minimizing C(T) is equivalent to minimizing C(T) given as 

C(T) = + faCSW (35) 

// the age-related cost function C a (t) satisfies (a) C' a (t) > 0; (b) C a (oo) = oo, then there exists 
unique refresh interval, designated T* , which minimizes C(T) and C(T). Moreover, T* is deter- 
mined by the equation 



TC a (T)- J C a {t)dt = C r 



4 Conclusion 

There is a fundamental tradeoff between the freshness of content and the overhead of refresh 
synchronization. An excessive refresh puts a strain on computation resource of network bandwidth 
and servers, while a deficient refresh sacrifices the required freshness of content. This paper is 
focused on studying the effect of refresh scheduling on the freshness-cost tradeoff. We formulate 
the refresh scheduling problem with a generic cost model to capture the relationship between the 
arrival rate of content update, the freshness of content and the cost of refresh synchronization, and 
then use this cost model to determine an optimal refresh interval that minimizes the overall cost 
involved. 

Theoretical results obtained in this paper suggest that the optimal refresh frequency should be 
a function of the arrival rate of content update and the refresh cost, implying that refresh frequency 
is determined by the arrival rate of content update in order to maintain a certain level of content 
freshness. Such a viewpoint has been implicitly reflected in the fact that many crawler-based 
Web search engines like google, Inktomi and fast started developing an automated tool to identify 
the content change rate at some Web sites, and adapt refresh rate to the actual changing rate of 
content sources [8]. This paper gives a quantitative analysis of the freshness-cost tradeoff and offers 
a theoretical guidance in determining the best tradeoff between the freshness of content and the 
cost of refresh synchronization. 
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